The effect of genome graph expressiveness on the discrepancy between genome graph distance and string set distance

Abstract Motivation Intra-sample heterogeneity describes the phenomenon where a genomic sample contains a diverse set of genomic sequences. In practice, the true string sets in a sample are often unknown due to limitations in sequencing technology. In order to compare heterogeneous samples, genome graphs can be used to represent such sets of strings. However, a genome graph is generally able to represent a string set universe that contains multiple sets of strings in addition to the true string set. This difference between genome graphs and string sets is not well characterized. As a result, a distance metric between genome graphs may not match the distance between true string sets. Results We extend a genome graph distance metric, Graph Traversal Edit Distance (GTED) proposed by Ebrahimpour Boroojeny et al., to FGTED to model the distance between heterogeneous string sets and show that GTED and FGTED always underestimate the Earth Mover’s Edit Distance (EMED) between string sets. We introduce the notion of string set universe diameter of a genome graph. Using the diameter, we are able to upper-bound the deviation of FGTED from EMED and to improve FGTED so that it reduces the average error in empirically estimating the similarity between true string sets. On simulated T-cell receptor sequences and actual Hepatitis B virus genomes, we show that the diameter-corrected FGTED reduces the average deviation of the estimated distance from the true string set distances by more than 250%. Availability and implementation Data and source code for reproducing the experiments are available at: https://github.com/Kingsford-Group/gtedemedtest/. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
Intra-sample heterogeneity describes the phenomenon where a genomic sample contains a diverse set of genomic sequences. A heterogeneous string set is a set of strings where each string is assigned a weight representing its abundance in the set. Computing the distance between heterogeneous string sets is essentially computing the distance between two distributions of strings. We formulate the problem of heterogeneous sample comparison as the heterogeneous string set comparison problem.
This problem can be used to compare samples where differences can be traced to the differences between sets of genomic sequences. For example, cancer samples are clustered based on differences in their genomic and transcriptomic features (Morris et al., 2016;Zhao et al., 2019) into cancer subtypes that correlate with patient survival rates. The dissimilarities between T-cell receptor (TCR) sequences are computed between individuals to study immune responses (Bolen et al., 2017). Different compositions of these sequences result in different clinical outcomes such as response to treatment.
We point out that the Earth Mover's Distance (EMD) (Rubner et al., 2000), or the Wasserstein distance (Wasserstein et al., 1969), with edit distance as the ground metric is an elegant metric to compare a pair of heterogeneous string sets. Given two distributions of items and a cost to transform one item into another, EMD computes the total cost of transforming one distribution into another. The EMD was initially used in computer vision to compare distributions of pixel values in images (Levina and Bickel, 2001) and later adapted to natural language processing (Kusner et al., 2015). It has also been used to approximate the distance between two genomes (Mangul and Koslicki, 2016) by computing the distance between two distributions of k-mers. To compare heterogeneous string sets, when the strings and their distributions are known, we use edit distance as the cost to transform one string to another. We refer to this as the Earth Mover's Edit Distance (EMED).
In practice, the complete strings of interest and their abundances are often unknown because these strings are only observed as fragmented sequencing reads. It is impossible to exactly compute EMED between the true sets of complete strings from the sequencing reads only.
The challenges posed by incomplete observed sequences can be alleviated by representing the string set using a graph structure. Multiple types of genome graphs have been introduced (Almodaresi et al., 2018;Dilthey et al., 2015;Garrison et al., 2018;Holley and Melsted, 2020;Iqbal et al., 2012;Lee and Kingsford, 2018;Li et al.,  i404 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Bioinformatics, 38, 2022, i404-i412 https://doi.org/10.1093/bioinformatics/btac264 ISCB/ISMB 2022 2020; Minkin et al., 2017;Paten et al., 2017Paten et al., , 2011. For our purposes, a genome graph is a directed multigraph with labeled nodes and weighted edges, along with a source and a sink node. A string is spelled by a source-to-sink path, or s-t path, if it is equal to the concatenation of node labels on the path. We say that a genome graph represents a string set if the union of paths that spells each string in the set is equal to the graph. In other words, a string set can be spelled by a decomposition of the genome graph.
There are several methods that compute the distance between genome graphs (Ebrahimpour Boroojeny et al., 2020;Minkin and Medvedev, 2020;Polevikov and Kolmogorov, 2019). Among those, Graph Traversal Edit Distance (GTED) (Ebrahimpour Boroojeny et al., 2020) is a general measure that can be applied to genome graphs and does not rely on the type of genome graphs nor the knowledge of the true string sets. Given two genome graphs, GTED finds an Eulerian cycle in each graph that minimizes the edit distance between the strings spelled by each cycle.
However, applying GTED on genome graphs representing heterogeneous string sets may overestimate the similarity between these string sets for two reasons. First, since GTED computes the distance between Eulerian cycles in genome graphs, it may align the prefix of a string to the suffix of another string with no additional penalties. We address this challenge by proposing an extension of GTED, called FGTED, which penalizes direct alignment of prefixes of a string with suffixes of other strings.
Second, and more significantly, both FGTED and GTED compute the edit distance between the two string sets represented by each genome graph that are most similar to each other. However, a genome graph that is constructed from sequencing fragments typically is able to represent more than one set of strings (Kingsford et al., 2010;Paten et al., 2018). As a genome graph merges shared sequences into the same node, it creates chains of bubble structures (Zerbino and Birney, 2008) that result in an exponential number of possible paths, and these paths spell a much more diverse collection of strings than the original set. We call the degree to which a genome graph encodes a larger set of strings than the true underlying set the 'expressiveness' of a genome graph. Due to the expressiveness of a genome graph, the Eulerian cycles found by GTED may not spell the true set of strings and the computed distance may be far from the true distance between string sets used to construct the graphs (Fig. 1a).
We prove both that FGTED always produces a distance that is larger than or equal to GTED, and that FGTED computes a metric that is always less than or equal to the EMED between true sets of strings.
However, FGTED and GTED can be quite far from the EMED. To resolve this discrepancy between FGTED and EMED, we define the collection of strings that can be represented by the genome graph as its string set universe, and genome graph expressiveness as the diameter of its string set universe (SUD), which is the maximum EMED between two string sets that can be represented by the graph (Fig. 1b).
Using diameters, we are able to upper-bound the deviation of FGTED from EMED. Additionally, we are able to correct FGTED and more accurately estimate the true string set distance empirically. On simulated TCR sequences, we reduce the average deviation of FGTED from EMED by more than 300%, and increase the correlation between the true and estimated string set distances by 20%. On Hepatitis B virus genomes, we reduce the average deviation by more than 250%.
These results provide the first connection between comparisons of genome graphs that encode multiple sequences and a natural string distance and provide the first formalization of the expressiveness of genome graphs. Additionally, they provide a practical method to estimate and reduce discrepancy between genome graph distances and string set distances.
2 Preliminary concepts 2.1 Strings DEFINITION 1 (Heterogeneous string set). A heterogeneous string set S ¼ fðw 1 ; s 1 Þ; . . . ; ðw n ; s n Þg contains a set of strings, where each string s i is assigned a weight w i 2 ½0; 1 that indicates the abundance of s i in S. We say that the total weight of S is P i2½1;n w i ¼ 1. EDðs 1 ; s 2 Þ is the minimum cost to transform s 1 into s 2 under edit distance (Levenshtein et al., 1966). The set of operations that transforms s 1 to s 2 can be written as an alignment between s 1 and s 2 , or A ¼ alignðs 1 ; s 2 Þ.
The i-th position in A is denoted by A i ½ ¼ c ð1;iÞ c ð2;iÞ ! , where c ða;iÞ is either a gap character '-' or a character in s a .

Earth Mover's Edit Distance
To find a distance between two heterogeneous string sets, we need to take into account not only the distance between pairs of strings, but also the weight, or abundance of each string in the set. When we are comparing two heterogeneous string sets, we are essentially comparing two distributions of strings. Therefore, we propose using the EMD as a natural distance measure. Given two distributions of items (here, strings) and a cost function that quantifies the cost of transforming one item into another, the EMD between the two distributions is the minimum cost to transform one distribution into another. Computing EMD can be viewed as a transportation problem that finds a many-to-many mapping between two sets of items and minimizes the total cost of the mapping (Rubner et al., 2000;Wasserstein et al., 1969).
Given two heterogeneous string sets S 1 ¼ fðw 1 ; s 1 Þ; . . . ; ðw n ; s n Þg and S 2 ¼ fðw nþ1 ; s nþ1 Þ; . . . ; ðw m ; s m Þg, to compute the EMED, we use the edit distance between s i and s j as the cost of transforming one string to another. Following procedures to compute EMD (Rubner et al., 2000) as a min-cost max-flow problem, we find a mapping M, where Mðs i ; s j Þ is the amount of s i 2 S 1 to be transformed into si2S1 Mðs i ; s j Þ Á EDðs i ; s j Þ. We define that EMEDðS 1 ; S 2 Þ ¼ min M costðMÞ. A valid flow network typically has more than one flow decomposition. Let the set of possible flow decompositions of G be D G .

Genome graphs
There are many variants of genome graphs used for various purposes and in various settings. Here, we introduce the definition of genome graphs we will use.
DEFINITION 4 (Genome graph). A genome graph G ¼ ðV; E; l; wÞ is a valid flow network with node set V, edge set E, node labels l(u) for each u 2 V and edge weights w(e) for each e 2 E. A genome graph contains a source node s and a sink node t, and lðsÞ ¼"$", lðtÞ ¼"#", where $ and # are special characters that do not appear in any string set considered in the scope of this manuscript.
Define operator SðÁÞ that transforms a set of paths in a genome graph G to a set of strings by concatenating the node labels on each path. SðPÞ ¼ fðconcatðpÞ; wðpÞÞjp 2 Pg is a heterogeneous string set where the weight of each string is equal to the weight of the path that spells the string.
DEFINITION 5 (String set represented by a genome graph). A genome graph G represents a string set S if there exists a decomposition DðGÞ 2 D G , such that SðDðGÞÞ ¼ S.
We use G ¼ GðSÞ to denote when G represents S.
DEFINITION 6 (String set universe represented by a genome graph). The string set universe SUðGÞ of a genome graph G is the collection of all heterogeneous string sets that can be represented by G. Formally, SUðGÞ ¼ fSðDÞjD 2 D G g.

Alignment graph
An alignment graph is used to align two genome graphs (Ebrahimpour Boroojeny et al., 2020) and can be viewed as a graph product between two genome graphs. A special case of the alignment graph (Jain et al., 2020) is used to align a string to a graph where the string is represented as a graph with only one path. We assume that the genome graphs to be aligned are transformed so that the label of each node contains only one character.
DEFINITION 7 (Alignment graph). Given genome graphs G 1 ¼ ðV 1 ; E 1 ; l 1 ; w 1 Þ and G 2 ¼ ðV 2 ; E 2 ; l 2 ; w 2 Þ, an alignment graph AGðG 1 ; G 2 Þ ¼ ðV A ; E A ; cost; wÞ is a directed graph with node set V A , edge set E A , edge cost cost(e) and edge weight w(e) for each edge e 2 E A . The alignment graph is defined following the steps: • V A is constructed by adding pairings of nodes in V 1 and V 2 ; that is • For each edge ðu 1 ; v 1 Þ 2 E 1 and ðu 2 ; v 2 Þ 2 E 2 , where ðu 1 ; u 2 Þ 2 V A and ðv 1 ; v 2 Þ 2 V A , create three types of edges: The cost of an in/del edge and a mismatch edge is equal to a customized penalty. The cost of a match edge is equal to zero. A match/mismatch edge should be distinguished with an in/del edge if the corresponding edge in one of the input graphs is a self-loop.
Each edge e ¼ ððu 1 ; u 2 Þ; ðv 1 ; v 2 ÞÞ in an alignment graph can be projected onto one edge in each of the input graphs. An edge in each of the input graphs can also be projected onto a set of edges in AG. The size of AG is OðjE 1 j Á jE 2 jÞ and thus the time to construct AG is quadratic in the sizes of the input genome graphs.
DEFINITION 8 (Projection function). Define the projection function as P ðG;HÞ ðeÞ ¼ E 0 that maps an edge e from graph G to a set of edges E 0 in graph H. The projection function maps an edge in the alignment graph to the edges in the input graphs that are matched together by that edge. It also maps an edge in one of the input graphs to a set of edges in the alignment graph where it is matched with other edges in another input graph. Specifically: Projection from alignment graph to one of the input graphs is defined by P ðAG;GiÞ ððu 1 ; u 2 Þ; ðv 1 ; v 2 ÞÞ ¼ fðu i ; v i Þg; i 2 f1; 2g: Projection from one of the input graphs to alignment graph is defined by Given a set of paths P in AG, we use P ðAG;GiÞ ðPÞ to denote the projection of P onto G i , where P ðAG;GiÞ ðPÞ ¼ fP ðAG;Gi Þ ðpÞjp 2 Pg and the projection For convenience, we define that f i ðDðAGÞÞ ¼ SðP ðAG;GiÞ ðDðAGÞÞÞ, which is the set of strings spelled by a path decomposition in AG that is projected onto G i .

Graph Traversal Edit Distance
GTED, proposed by Ebrahimpour Boroojeny et al. (2020), is a distance between two labeled graphs which are assumed to be Eulerian graphs. Given a genome graph in our definition, we add an edge directing from sink to source with weight equal the sum of edge weights that are directing from the source node in order to make an Eulerian graph.
Let the language of G; LðGÞ, be the set of strings spelled by Eulerian cycles in G. Formally, LðGÞ ¼ fSðcÞjc is an Eulerian cycle in Gg.
DEFINITION 9 (Graph Traversal Edit Distance (Ebrahimpour Boroojeny et al., 2020)). Let G 1 and G 2 be two Eulerian graphs, where the weights on the edges are seen as the number of times an edge must be visited in each Eulerian cycle. Then, EDðs 1 ; s 2 Þ: GTED finds one Eulerian cycle in each genome graph such that the edit distance between the strings spelled by the Eulerian cycles is minimized. GTED is computed by solving a linear programming (LP) formulation (Equations (1)-(4)) on the alignment graph AGðG 1 ; G 2 Þ, which minimizes the cost of a flow in the graph with the flow conservation (Equation (4)) and flow coverage constraints (Equations (2) and (3)). The LP formulation is as follows: Ebrahimpour Boroojeny et al. (2020) prove that GTED is equal to the optimal solution of this LP formulation, and thus GTED is computable in polynomial time. The number of constraints in the above LP is linear in the size of the alignment graph and thus quadratic in the size of the input genome graphs.

An extension of GTED
GTED was originally used to compare genome graphs that are assumed to contain single genomes. It is therefore intuitive that each string represented by the genome graph is spelled with an Eulerian cycle. This property follows the property of assembly graphs (Pevzner et al., 2001). When the genome graph represents more than one string, finding a string spelled by an Eulerian cycle c in the graph is equivalent to finding a concatenation of a permutation of strings in a string set. When aligning two Eulerian cycles, c 1 and c 2 , from input graphs, the boundaries between strings are ignored and the prefix of one string may be aligned to the suffix of another string with no cost. However, such alignment is not allowed when we align sets of strings using EMED.
We propose an extension of GTED with a modified cost function in edit distance computation so that the cost of aligning the sink character # with any other character is infinity. Figure 2a shows an example of the alignment graph built from two input graphs using the proposed cost function. Let the sink nodes in G 1 and G 2 be t 1 and t 2 , and the source nodes be s 1 and s 2 , respectively. After removing all the alignment edges with infinite costs, there is an edge to the alignment node (t 1 , t 2 ) in AG if and only if there exists an edge (u 1 , t 1 ) in G 1 and an edge (u 2 , t 2 ) in G 2 . The only incoming edge to (s 1 , s 2 ) is ððt 1 ; t 2 Þ; ðs 1 ; s 2 ÞÞ. We refer to the edge ððt 1 ; t 2 Þ; ðs 1 ; s 2 ÞÞ as the sink-to-source edge, or t-s edge in alignment graph in the rest of the article.
We let Flow-GTED, or FGTED, denote the distance computed using the alignment graph after removing all infinity cost edges.
FGTED assumes that the input genome graphs are flow networks that represent string sets, which can be seen as an analog to Eulerian tours in the graphs that are used as input for GTED. Sink-to-source edges are added to transform flow networks into Eulerian graphs such that FGTED can be reduced to GTED. As FGTED solves a similar LP formulation as GTED that is constructed on a slightly smaller alignment graph, FGTED is also solvable in polynomial time.
PROOF. Since FGTED is computed on a smaller alignment graph that contains fewer edges than that for computing GTED, FGTED explores a smaller solution space than GTED in solving the LP formulation. Therefore, any feasible solution to the LP formulation for FGTEDðG 1 ; G 2 Þ is a feasible solution to the LP formulation for GTEDðG 1 ; G 2 Þ. Since GTEDðG 1 ; G 2 Þ minimizes the objective, the theorem is true. ٗ Edges in G 1 and G 2 that are highlighted with matching colors are projections from edges in AG 0 to G 1 and G 2 , respectively. Path ð$; A; T; #Þ 2 G 1 is aligned to ð$; A; T; #Þ 2 G 2 and path ð$; A; C; #Þ 2 G 1 is aligned to ð$; A; #Þ 2 G 2 . The weights on AG and AG 0 edges are omitted for simplicity 4 The relationship between GTED, FGTED and EMED Let AG Ã be the alignment graph after removing the t-s edge and all the edges from fejx e ¼ 0; e 2 E A g from the LP solution to Equations (1)-(4). We say that AG Ã is a solution of FGTED. Due to constraints (2)-(4), AG Ã is a valid flow network. Let DðAG Ã Þ be a flow decomposition in AG Ã . Similar to the Eulerian cycles found during the GTED computation, each path in DðAG Ã Þ can be projected to a path in G 1 and a path in G 2 . Denote S Ã 1 ¼ f 1 ðDðAG Ã ÞÞ ¼ SðP ðAG Ã ;G1Þ ðDðAG Ã ÞÞÞ as the set of strings spelled by the set of projected paths from a decomposition DðAG Ã Þ to G 1 . Similarly, S Ã 2 ¼ f 2 ðDðAG Ã ÞÞ ¼ SðP ðAG Ã ;G2Þ ðDðAG Ã ÞÞÞ. We show an example of path projections in Figure 2b.
Observing that we can do flow decomposition in both the FGTED solution and input genome graphs, we will show in this section that FGTED can be bounded by EMED between decompositions in the input genome graphs and in the alignment graph solutions.
THEOREM 2. Given two sets of strings S 1 and S 2 , and genome graphs representing these string sets, where AG Ã is the solution obtained from FGTEDðG 1 ; G 2 Þ. ٗ The proof of this theorem is completed in two parts. The first inequality is proven in Section 4.1 and the second is proven in Section 4.2. Since FGTED computes a distance that is larger than GTED between the same pair of genome graphs (Theorem 1), Theorem 2 also shows that FGTED always estimates the distance between true string sets more accurately than GTED.

FGTED is always less than or equal to EMED
We show in this section that FGTED can be expressed in terms of EMED between string sets constructed from decomposing AG Ã . In other words, while GTED finds an Eulerian tour in each input graph, FGTED finds a flow decomposition in G 1 and G 2 , respectively, that minimizes the EMED between them. Analogous to Definition 9, we have: THEOREM 3. Given two genome graphs G 1 and G 2 , DðG2Þ 2 DG 2 EMEDðSðDðG 1 ÞÞ; SðDðG 2 ÞÞÞ Theorem 3 allows us to define FGTED as the minimum EMED between flow decompositions in input graphs. To prove Theorem 3, we first explore the relationship between an s-t path in AG Ã and the strings spelled by the projections of this path onto G 1 and G 2 .
Let AG 0 ¼ AG Ã p [ p 0 . Both p and p 0 can be found in AG, and both p and p 0 can be constructed by the alignment of the same pair of strings. Therefore, AG 0 is also a valid flow network and a feasible solution to FGTED. Since AG Ã is the optimal solution to FGTED; costðAG Ã Þ costðAG 0 Þ, and costðAG Ã Þ À costðAG 0 Þ 0 ) wðpÞ Á ðcostðpÞ À costðp 0 ÞÞ 0 ) wðpÞ Á ðcostðpÞ À EDðs 1 ; s 2 ÞÞ 0 ) costðpÞ EDðs 1 ; s 2 Þ: ٗ We have shown that the cost of an s-t path in AG Ã is equal to the edit distance between its projections onto input graphs. Using this lemma, we can transform an optimal FGTED solution into an EMED solution.
LEMMA 2. Given S Ã 1 and S Ã 2 obtained from any decomposition DðAG Ã Þ 2 D AG Ã , where costðAG Ã Þ is sum of edge costs in the solution alignment graph to FGTED.
PROOF. We prove in two directions.
( direction) We construct a mapping M between strings in S Ã 1 and S Ã 2 from the decomposition DðAG Ã Þ, where Mðs i ; s j Þ is the portion of s i 2 S 1 and s j 2 S 2 that are aligned. For each p 2 DðAG Ã Þ, we obtain s 1 and s 2 as strings constructed from projections of p onto G 1 and G 2 and increment the weight of mapping Mðs 1 ; s 2 Þ by w(p). After iterating through all paths in DðAG Ã Þ, the cost of M is M is also a feasible solution to the LP formulation of EMED. Since EMED minimizes the cost of mapping between S Ã 1 and S Ã 2 ; EMEDðS Ã 1 ; S Ã 2 Þ costðAG Ã Þ. (! direction) We construct a valid flow network, AG 0 using an optimal solution to EMEDðS Ã 1 ; S Ã 2 Þ. For each pairing (s i , s j ) for s i 2 S 1 and s j 2 S 2 , we obtain its weight w and cost c from the EMED solution. Let A ¼ alignðs i ; s j Þ be an optimal alignment under edit distance, and cost(A) ¼ c. We then add a path corresponding to A with weight w in AG 0 . This follows the same procedure in the proof of Lemma 1. After adding all paths, we obtain AG 0 with costðAG 0 Þ ¼ EMEDðS Ã 1 ; S Ã 2 Þ. Since costðAG Ã Þ is minimized by FGTED, costðAG 0 Þ ! costðAG Ã Þ ) EMEDðS Ã 1 ; S Ã 2 Þ ! costðAG Ã Þ. ٗ Lemma 2 provides a transformation algorithm between optimal solutions to EMED and solutions to FGTED. Using Lemma 2, we can show that the EMED between S Ã 1 and S Ã 2 constructed from any decomposition in AG Ã is equal to the decompositions of G 1 and G 2 that are closest in terms of EMED. LEMMA 3. Given S Ã 1 and S Ã 2 obtained from any decomposition DðAG Ã Þ 2 D AG Ã , In Lemma 2, S Ã 1 and S Ã 2 can be constructed from decomposing G 1 and G 2 . Suppose for contradiction that there exists a decomposition that constructs string sets S 0 1 and S 0 2 , such that EMEDðS Ã 1 ; S Ã 2 Þ > EMEDðS 0 1 ; S 0 2 Þ. Following the procedure in the proof of Lemma 2, we can construct a feasible solution to FGTED with cost equal to EMEDðS 0 1 ; S 0 2 Þ, which is less than costðAG Ã Þ ¼ EMEDðS Ã 1 ; S Ã 2 Þ. This contradicts with the assumption that FGTED minimizes costðAG Ã Þ. ٗ Theorem 3 is therefore true because of Lemmas 2 and 3. Using Theorem 3, we are able to prove the first inequality in Theorem 2 with Lemma 4. LEMMA 4. Given heterogeneous string sets S 1 and S 2 and genome graphs representing these string sets, G 1 ¼ GðS 1 Þ and G 2 ¼ GðS 2 Þ, FGTEDðG 1 ; G 2 Þ EMEDðS 1 ; S 2 Þ.

PROOF. Given Theorem 3, FGTED finds flow decomposition in D G1
and D G2 that minimizes the EMED between them. Since S 1 and S 2 can be constructed from a flow decomposition in D G1 and D G2 , respectively, this lemma is true. ٗ

Genome graph expressiveness
A genome graph typically can represent more than one set of strings. We name the collection of string sets representable by a genome graph the string set universe of that genome graph, or SUðGÞ. Using Theorem 3, we can say that FGTED finds two sets of strings in the string set universe of G that are closest in the metric space of EMED. We define the expressiveness of a genome graph as the diameter of its string set universe, which is the maximum EMED between the string sets in SUðGÞ.

String set universe diameter as an upper bound on deviation of FGTED from EMED
The string set universe diameter gives one measure of the size of SU(G), and it can also be used to characterize the deviation of GTED from EMED. Recall that S Ã 1 and S Ã 2 are string sets obtained from a decomposition DðAG Ã Þ, and that EMEDðS Ã 1 ; S Ã 2 Þ ¼ FGTEDðGðS 1 Þ; GðS 2 ÞÞ, where S 1 and S 2 are true string sets. From Theorem 2, we have that EMEDðS 1 ; S 2 Þ ! EMEDðS Ã 1 ; S Ã 2 Þ. We can bound the deviation of EMEDðS Ã 1 ; S Ã 2 Þ from EMEDðS 1 ; S 2 Þ using triangle inequalities.
PROOF. Both edit distance and EMD are metrics (Levenshtein et al., 1966;Rubner et al., 2000), which means that triangle inequality holds for EMED between strings. Therefore, for any string sets S Ã 1 and S Ã 2 , The above inequality (6) holds for any string sets S Ã 1 and S Ã 2 . To give a tight upper bound on the deviation, we take the minimum over all possible pairs of string sets constructed from decomposing AG Ã that yields inequality (5).
Lemma 5 proves the second inequality of Theorem 2 thus completing the proof for Theorem 2 with Lemma 4.
The upper bound found in Lemma 5 can be used as a factor that evaluates the pairwise expressiveness of two genome graphs. While a genome graph may represent a large universe of string sets, as long as the true string set is close to the 'best' string set in the pair-wise comparison, the deviation of FGTED from EMED is small. We define this upper bound as the String Universe Co-Expansion Factor (SUCEF), which can be used to evaluate the discrepancy between FGTED and EMED.
On the other hand, finding SUCEF not only requires knowledge of true string sets S 1 and S 2 , but SUCEF is also a pair-dependent measure that needs to be calculated for every pair of string sets and corresponding genome graphs. In order to characterize the effect of the expressiveness of individual genome graphs, we derive another upper bound on the deviation of FGTED from EMED using the string set universe diameters.
The sum of string set universe diameters of two genome graphs is an upper bound on SUCEF of these graphs and any two sets of strings they represent. LEMMA 6. Given two genome graphs G 1 and G 2 and two sets of strings S 1 and S 2 they represent, EMEDðS 1 ; S 2 Þ À FGTEDðG 1 ; G 2 Þ SUCEFðS 1 ; S 2 ; G 1 ; G 2 Þ SUDðG 1 Þ þ SUDðG 2 Þ: PROOF. Both S 1 and S Ã 1 are represented by G 1 and belong to SUðG 1 Þ. Therefore, by definition of string set universe diameter, EMEDðS 1 ; S Ã 1 Þ SUDðG 1 Þ as the diameter maximizes the distance between any pair of strings represented by the genome graph. The same holds for EMEDðS 2 ; S Ã 2 Þ SUDðG 2 Þ. ٗ Using Lemma 6, we can bound the deviation of FGTED from EMED using the expressiveness of individual genome graphs even when we do not have the knowledge of ground truth string sets. In practice, we can construct genome graphs using known sequences from the species of interest and form a training set. Using the training set, we can learn the relationship between SUDs and the deviation of FGTED from EMED, and then empirically estimate the anticipated discrepancy between FGTED and EMED. In the following sections, we show that we can improve FGTED using SUDs to obtain reduced anticipated deviation from EMED and stronger correlation with EMED.
5 Practically correcting the discrepancy between FGTED and EMED 5.1 Estimating string set universe diameters The string set universe diameter of a genome graph can be estimated by sampling flow decompositions of the graph. To sample a flow decomposition, we first sample one s-t path. At each node u, we choose the neighbor v with the highest edge weight w(u, v) with probability 0.5 and randomly choose a neighbor otherwise. After sampling a path, we send flow that is equal to the minimum edge weight on that path and produce the residual graph by subtracting the flow from edge weights on that s-t path. We repeat this process on the residual graph until all edge weights are zero. This process assumes that the input genome graphs are acyclic to ensure all edge capacities (weights) are satisfied. If a genome graph is cyclic, e.g. de Bruijn graphs, string sets from SUðG 1 Þ can be obtained by sampling Eulerian cycles in the genome graph, and each string in the string set is obtained by segmenting the sampled Eulerian cycle at source and sink nodes. After sampling 50 pairs of flow decompositions, we construct string sets from sampled flow decompositions and calculate pairwise EMED. We then obtain the highest pairwise EMED and use it as the estimated diameter.

Correcting FGTED using string set universe diameters
Using the sum of SUDs, we empirically estimate the deviation of FGTED from EMED with a linear regression model. We denote the deviation of FGTED from EMED by deviationðS 1 ; S 2 ; G 1 ; G 2 Þ, which is computed as jEMEDðS 1 ; S 2 Þ À FGTEDðGðS 1 Þ; GðS 2 ÞÞj. The linear regression model, LR, has the following form where a is the coefficient of the model and b is the intercept. The fitted model will minimize the mean squared error between predicted deviation and true deviation in the training set.
The corrected FGTED for each pair of graphs is calculated using the learned linear regression model as follows.
The deviation of corrected FGTED from EMED has the same form as the deviation of uncorrected FGTED from EMED.

Data
We evaluate the use of string set universe diameters on two sequence sets: 1. Simulated T-cell receptor (TCR) repertoire. We simulate 50 sets of TCR sequences and assign weights to each sequence using reference gene sequences of V, D and J genes from Immunogenetics (IMGT) V-Quest sequence directory (Lefranc and Lefranc, 2001). The number of sequences in each set varies from 2 to 5. We then generate 225 pairs of TCR string sets. Each TCR sequence is about 300 base pairs long. See Supplementary Materials for detailed simulation process. 2. Hepatitis B virus (HBV) genomes. We collect 9 sets of HBV genomes from three hosts-humans, bats and ducks-from the NCBI virus database (Hatcher et al., 2017). We build 36 pairs of HBV string sets. See Supplementary Materials for detailed string set construction process.
We construct a partial order multiple sequence alignment (MSA) graph on each string set (Lee et al., 2002). We first conduct MSA for each string set using Clustal Omega (Sievers et al., 2011). Then for each column of the MSA, we create a node for each unique character and add an edge between two nodes if the characters in node labels are adjacent in the input strings at that column. For each consecutive stretch of gap characters, no nodes are created, but an edge is added between flanking columns of the stretch of gaps. We also create a source node and a sink node that are connected to nodes representing the first and last characters of the input strings. The MSA graphs created in this process are all acyclic. We compute FGTED on MSA graphs by adding sink-to-source edges.
We also construct a de Bruijn graph (Pevzner et al., 2001) with k-mer size equal to 4 on TCR sequence sets, which we refer to as dBG4 in the following sections. This k-mer size is reasonable as compared to the average lengths of TCR sequences which is 350 base pairs and allows us to experiment with graphs that are expected to have higher expressiveness. In dBG4, each node corresponds to a k-mer, S½i : i þ k, where S is a string from the ground truth string set, S. Each edge corresponds to the overlap between two k-mers, S½i : i þ k and S½i þ 1 : i þ k þ 1 for any S 2 S. In order to construct the alignment graph, we process the de Bruijn graphs such that each node represents one character. We add a source node and a sink node to each dBG4 and connect them to nodes that represent the first and the last character in each string, respectively.
All of used sequences and constructed genome graphs can be found in the GitHub repository at https://github.com/Kingsford-Group/gtedemedtest/.

FGTED deviates from EMED as the expressiveness of the genome graph increases
We compute EMED and FGTED on string set pairs and genome graph pairs. The alignment graphs are constructed using one thread, which on average takes 6 s for dBG4s, 8 s for each MSA graph on TCR sequences, 9.43 min for each MSA graph on HBV genomes. Optimization for LP with 10 threads takes on average 601 s for each dBG4, 1 h for each MSA graph of TCR sequence sets and 4 h for each MSA graph on HBV genomes ( Supplementary Fig. S1).
We show that the deviation of FGTED from EMED is higher on genome graphs that are more expressive. We compare the FGTED computed on dBG4s and MSA graphs constructed with TCR sequences and the diameters of two types of graphs. DBG4 represents all sequences with the same 5-mer distributions as the ground truth sequences. Therefore, as expected, we observe larger sampled SUDs from dBG4 than the MSA graphs (Fig. 3a). The deviation of FGTED from EMED is also larger with dBG4s than the MSA graphs (Fig. 3b). This further illustrates the effect of graph construction approaches on the resulting expressiveness.

Corrected FGTED more accurately estimates distance between unseen string sets encoded with genome graphs
For each pair of string sets, we obtain the deviation of FGTED from EMED and sum of estimated SUDs. We fit three linear regression models, LR dBG4 , LR TCR and LR HBV , to predict deviation from sum of SUDs on simulated TCR sequences and HBV genomes of different types of graphs separately.
We evaluate the corrected and uncorrected FGTED by performing Pearson correlation experiments. We fit LR models on half of the data and compute the corrected FGTED on the other half as the test set. We evaluate the correlation between corrected and uncorrected FGTED and EMED on the test set. Two-tail Pvalues are calculated for each correlation experiment to test for non-correlation.
The LR models are evaluated with 10-fold cross validation. We randomly permute and split data into 10 equal parts. In each of the 10 iterations, we use one part as the test set and the rest as the training set. An average deviation is calculated across all iterations.
In Tables 1 and 2, we show that using string set universe diameters, we are able to improve the correlation between FGTED and EMED on MSA graphs of both the simulated TCR sequences and HBV genomes. On dBG4s, the correlation is reduced slightly by the correction. All Pearson correlation experiments are statistically significant with P-values <0.01, which tests for the probability of noncorrelation. On HBV genomes, since the correlation between uncorrected FGTED and EMED is approaching 1, no significant improvement is observed. On the other hand, significant reduction in average deviation is observed on both types of data. We are able to reduce the average deviation from 77.29 to 19.08 on de Bruijn Graphs with TCR sequences, from 32.74 to 9.13 on MSA graphs containing simulated TCR sequences and from 140.12 to 54.87 on HBV genomes.
One caveat of using SUDs for correcting distances between genome graphs is that this correction is not guaranteed to always improve the distance. Given two string sets, there is usually an adversarial worst case where adjusting the distance using this approach reduces the accuracy in estimating string sets distances. When EMED between true string sets are small, the corrected FGTED may overestimate the EMED and result in a larger deviation. Nevertheless, we show that corrected FGTED reduces the anticipated deviation from EMED.

Discussion
A genome graph's string set universe diameter (SUD) provides information on the size and diversity of the represented string sets. We show that we can use SUDs to practically characterize the discrepancy between FGTED and EMED and to obtain a more accurate distance between unseen string sets encoded in genome graphs on average. While the results are obtained on short genomic sequences due to the high computational cost of FGTED and GTED, this result is encouraging.
The corrected FGTED can be used to compute a more accurate distance between heterogeneous samples represented by genome graphs in applications such as immune repertoire analysis and cancer subtyping. This opens up avenues for more comprehensive heterogeneous sample comparison methods. However, FGTED, as well as GTED, is not scalable to mammalian genomes due to the quadratic size of the alignment graph and time it takes to solve the LP formulations. Algorithms that compute FGTED faster or efficient approximation genome graph comparison methods (Minkin and Medvedev, 2020;Polevikov and Kolmogorov, 2019) are needed for comparing large heterogeneous string sets.
SUDs may also be used to characterize the diversity of strings represented by reference genome graphs that are used in sequenceto-graph alignment (Rautiainen and Marschall, 2020;Sir en et al., 2020). In sequence-to-graph alignment, it is often desired that a more diverse set of strings than the original reference string set is represented by the graph. Here, SUDs could be used as a measure to control the right amount of variation in the string set universe of created genome graphs.
Another future direction is to use expressiveness as a regularization term in the objective function to construct better genome graphs. To ensure efficiency of genome graphs in storing sequences, we can construct genome graphs that minimize their sizes (Pandey et al., 2021;Qiu and Kingsford, 2021). However, reducing the size of a genome graph may result in graphs that are highly expressive, and the distance between these genome graphs will deviate further from distances between true string sets. Adding a SUD term to the objective may address this problem.  Note: Pearson correlation is calculated on a held-out set of data for both simulated TCR and HBV that consist of 50% of data, and LR model is fit on the other half. Bold font numbers identify the better performing method with higher Pearson correlation with EMED on each type of data. Note: The average deviation is calculated over a 10-fold cross-validation of the LR model. Bold font numbers identify the better performing method with lower average deviation from EMED on each type of data.